% Advanced FPGA Design - Exercise 2
% Norbert Tremurici (e11907086@student.tuwien.ac.at)
% 17th December, 2023

# Advanced FPGA Design

For this exercise, we have to implement a real-time power monitoring estimation system.
In order to do this effectively, we have decided to use an approach where only dynamic power is estimated in real-time, whereas static power is taken as a fixed value.
The system was developed in conjunction with data supplied by the Vivado power report for the design, so for the static power, the value in the power report is used.

#### Estimation Strategy

There is a lot possible in terms of power estimation.
First, there are different kinds of power consumption, besides dynamic and static power, FPGAs also have I/O power consumption and finally there is also short circuit power consumption.

Short circuit power consumption happens when the switching temporarily puts the transistor in a short-circuit state where the high and low voltage rails are connected and typically lasts for the short duration of the switching window of the transistors.
Because this is not only difficult to estimate, but also not as relevant, we neglect this form of power consumption.
We also neglect I/O power consumption because the task at hand is to estimate the power consumption of our design within the FPGA.

Static power on the other hand is caused by leakage, which gets increasingly significant as transistors continue shrinking.
We cannot directly influence leaking with our design, so for static power consumption we have opted to use the values given by the Vivado power report for the design.

Dynamic power is then the remaining block, one where we can can directly influence several parameters of it because those parameters depend on our design.
For the `alu` and `mul` parts of the ALU, the dynamic power can be calculated as follows:

$$
P_{dynamic} = \sum_{i \in \{\text{add}, \text{mul}\}} \alpha_{i} C_{i} V_{dd}^2 f
$$

As can be seen, there are several components, some of which are static, such as the voltage, the frequency the design actually uses and the capacitances.
Then there is a dynamic component, the switching activity $\alpha$ for each component.
The switching activity is a value in the range of $[0, 1]$ and represents how often some part of the system (which could be just a single signal or averaged into a factor for an entire subsystem) switches with respect to the total number of transitions given by the clock.
We will dive into how we try to estimate this factor in the next section of the report.

For the value of $V_{dd}$, the reported voltage of the Vivado power report was used, which is $1 V$.
This is just shy of the maximal internal voltage supported by the FPGA as listed in the datasheet, but it's a lucky thing for us, because that means in our power calculation we can drop the multiplication by $V_{dd}^2$, which would be a multiplication by $1^2$.

The frequency target of the design is set to the internally available clock frequency of $100 MHz$.

For the capacitances, an attempt was made to search wherever possible (in datasheets, presentations and whatever could be found) but in the end, nowhere was the internal capacitance listed.
So instead of finding this value, we tried to work backwards by using our power estimation system with all $C_{i}$ assumed as $1 F$ in order to collect some preliminary values.
Then we would use the actual estimated dynamic power as listed in the Vivado power report to calculate the capacitances by looking at the difference of the value of our estimation system and the reported dynamic power.
This would give us a reasonable value which would couple our estimation system with the results of the Vivado power report.

The estimated capacitance value is calculated in the simulations section below, but for completeness the value is also included here as $13.8 pF$.

#### Estimation System

![Simple Block Diagram of System\label{fig:block-diagram}](graphics/estimation-system.pdf)

Figure \ref{fig:block-diagram} shows the general setup of the entire system.
The dataflow begins with `opgen`, which is a unit responsible for the generation of data.
Here the unit enumerates possible ALU inputs as well as opcodes for the ALU.

The ALU is a simple 8-bit ALU with only two operations supported, (unsigned) addition and multiplication.
This result is reduced to two LEDs by way of an XOR reduction on the two halves of the result.
But more importantly, the results of both the adder and multiplier circuits of the ALU are fed into a counter that is used for the power estimation part of the circuit.

The counters are implemented by looking at how many input bits change each cycle.
We instantiate two counters for both the add and multiplier part of the circuits.
Those changes are accumulated in an internal register which counts until a fixed number of cycles has been reached.
The count then represents the activity factor $\alpha_{i}$, not yet normalized to a value in the range $[0, 1]$.
If the total number of cycles is some power of two $2^k$, and the count some value $s$ between $0$ and $2^k - 1$, then our activity factor for this window of time is $\alpha_{i} \cdot 2^k = s_{i}$.
As can be seen, the actual activity factor can easily be recovered by dividing by the fixed number of cycles in a window.
Because we choose powers of $2$, this is actually a simple shift operation.
In essence, because we only produce valid activity data once every fixed number of cycles and reset all counters, we know for every valid activity value how much time has passed since accumulation has started and can easily normalize without using expensive divisions in hardware.

The task of the estimator is to collect multiple activity factor values, add them together and scale them by the coefficients of the dynamic power formula.
If we assume $C_{i}$ to be the same for all subsystems, $\forall i : C_{i} = C$, and if we consider the missing normalization of the activity factor, then we can re-express the formula that is calculated as follows:

$$
\begin{aligned}
P_{dynamic} &= C V_{dd}^2 f \sum_{i \in \{\text{add}, \text{mul}\}} \alpha_{i} \\
&= C V_{dd}^2 f \sum_{i \in \{\text{add}, \text{mul}\}} 2^{-k} s_{i} \\
&= 2^{-k} C V_{dd}^2 f \sum_{i \in \{\text{add}, \text{mul}\}} s_{i} \\
\end{aligned}
$$

And that is how the dynamic power is currently being calculated by the system, using a value $k$ equal to $8$ (256 cycle windows).

To efficiently normalize an estimated power value with the constant coefficients $2^{-k} C V_{dd}^2 f$, we can calculate manually what is the result of this multiplication and approximate by taking the nearest power of 2 to find out how much we need to shift the result.

It turns out that all computed estimate values need to be shifted by $2^{-34}$ in total, so all reported values need to be scaled by $2^{-34}$ in order for them to be valid.

\newpage

#### Vivado power report

We have of course generated and used power reports, so here we list the results.

![Vivado power report total summary\label{fig:power-report-total}](graphics/power-report-total-summary.png)

Figure \ref{fig:power-report-total} shows the overall summary of the power report that has been generated.

![Vivado power report dynamic hierarchy\label{fig:power-report-dynamic}](graphics/power-report-dynamic-hierarchy.png)

Figure \ref{fig:power-report-dynamic} shows the dynamic power utilization split between leaf cells and the instantiated ALU module.
As can be seen, the estimated dynamic power consumption of the ALU is $0.001 W$.

#### Vivado utilization report

We also include a (textual) resource utilization report here:

```
Copyright 1986-2021 Xilinx, Inc. All Rights Reserved.
-------------------------------------------------------------------------------------------------
| Tool Version : Vivado v.2021.2 (lin64) Build 3367213 Tue Oct 19 02:47:39 MDT 2021
| Date         : Mon Dec 18 21:21:55 2023
| Host         : ayaya running 64-bit Void Linux
| Command      : report_utilization -file top_utilization_synth.rpt -pb top_utilization_synth.pb
| Design       : top
| Device       : xc7z020clg484-1
| Speed File   : -1
| Design State : Synthesized
-------------------------------------------------------------------------------------------------

Utilization Design Information

Table of Contents
-----------------
1. Slice Logic
1.1 Summary of Registers by Type
2. Memory
3. DSP
4. IO and GT Specific
5. Clocking
6. Specific Feature
7. Primitives
8. Black Boxes
9. Instantiated Netlists

1. Slice Logic
--------------

+-------------------------+------+-------+------------+-----------+-------+
|        Site Type        | Used | Fixed | Prohibited | Available | Util% |
+-------------------------+------+-------+------------+-----------+-------+
| Slice LUTs*             |  119 |     0 |          0 |     53200 |  0.22 |
|   LUT as Logic          |  119 |     0 |          0 |     53200 |  0.22 |
|   LUT as Memory         |    0 |     0 |          0 |     17400 |  0.00 |
| Slice Registers         |   63 |     0 |          0 |    106400 |  0.06 |
|   Register as Flip Flop |   63 |     0 |          0 |    106400 |  0.06 |
|   Register as Latch     |    0 |     0 |          0 |    106400 |  0.00 |
| F7 Muxes                |    0 |     0 |          0 |     26600 |  0.00 |
| F8 Muxes                |    0 |     0 |          0 |     13300 |  0.00 |
+-------------------------+------+-------+------------+-----------+-------+
* Warning! The Final LUT count, after physical optimizations and full implementation, is typically lower. Run opt_design after synthesis, if not already completed, for a more realistic count.


1.1 Summary of Registers by Type
--------------------------------

+-------+--------------+-------------+--------------+
| Total | Clock Enable | Synchronous | Asynchronous |
+-------+--------------+-------------+--------------+
| 0     |            _ |           - |            - |
| 0     |            _ |           - |          Set |
| 0     |            _ |           - |        Reset |
| 0     |            _ |         Set |            - |
| 0     |            _ |       Reset |            - |
| 0     |          Yes |           - |            - |
| 0     |          Yes |           - |          Set |
| 0     |          Yes |           - |        Reset |
| 0     |          Yes |         Set |            - |
| 63    |          Yes |       Reset |            - |
+-------+--------------+-------------+--------------+


2. Memory
---------

+----------------+------+-------+------------+-----------+-------+
|    Site Type   | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| Block RAM Tile |    0 |     0 |          0 |       140 |  0.00 |
|   RAMB36/FIFO* |    0 |     0 |          0 |       140 |  0.00 |
|   RAMB18       |    0 |     0 |          0 |       280 |  0.00 |
+----------------+------+-------+------------+-----------+-------+
* Note: Each Block RAM Tile only has one FIFO logic available and therefore can accommodate only one FIFO36E1 or one FIFO18E1. However, if a FIFO18E1 occupies a Block RAM Tile, that tile can still accommodate a RAMB18E1


3. DSP
------

+-----------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-----------+------+-------+------------+-----------+-------+
| DSPs      |    0 |     0 |          0 |       220 |  0.00 |
+-----------+------+-------+------------+-----------+-------+


4. IO and GT Specific
---------------------

+-----------------------------+------+-------+------------+-----------+-------+
|          Site Type          | Used | Fixed | Prohibited | Available | Util% |
+-----------------------------+------+-------+------------+-----------+-------+
| Bonded IOB                  |    4 |     0 |          0 |       200 |  2.00 |
| Bonded IPADs                |    0 |     0 |          0 |         2 |  0.00 |
| Bonded IOPADs               |    0 |     0 |          0 |       130 |  0.00 |
| PHY_CONTROL                 |    0 |     0 |          0 |         4 |  0.00 |
| PHASER_REF                  |    0 |     0 |          0 |         4 |  0.00 |
| OUT_FIFO                    |    0 |     0 |          0 |        16 |  0.00 |
| IN_FIFO                     |    0 |     0 |          0 |        16 |  0.00 |
| IDELAYCTRL                  |    0 |     0 |          0 |         4 |  0.00 |
| IBUFDS                      |    0 |     0 |          0 |       192 |  0.00 |
| PHASER_OUT/PHASER_OUT_PHY   |    0 |     0 |          0 |        16 |  0.00 |
| PHASER_IN/PHASER_IN_PHY     |    0 |     0 |          0 |        16 |  0.00 |
| IDELAYE2/IDELAYE2_FINEDELAY |    0 |     0 |          0 |       200 |  0.00 |
| ILOGIC                      |    0 |     0 |          0 |       200 |  0.00 |
| OLOGIC                      |    0 |     0 |          0 |       200 |  0.00 |
+-----------------------------+------+-------+------------+-----------+-------+


5. Clocking
-----------

+------------+------+-------+------------+-----------+-------+
|  Site Type | Used | Fixed | Prohibited | Available | Util% |
+------------+------+-------+------------+-----------+-------+
| BUFGCTRL   |    1 |     0 |          0 |        32 |  3.13 |
| BUFIO      |    0 |     0 |          0 |        16 |  0.00 |
| MMCME2_ADV |    0 |     0 |          0 |         4 |  0.00 |
| PLLE2_ADV  |    0 |     0 |          0 |         4 |  0.00 |
| BUFMRCE    |    0 |     0 |          0 |         8 |  0.00 |
| BUFHCE     |    0 |     0 |          0 |        72 |  0.00 |
| BUFR       |    0 |     0 |          0 |        16 |  0.00 |
+------------+------+-------+------------+-----------+-------+


6. Specific Feature
-------------------

+-------------+------+-------+------------+-----------+-------+
|  Site Type  | Used | Fixed | Prohibited | Available | Util% |
+-------------+------+-------+------------+-----------+-------+
| BSCANE2     |    0 |     0 |          0 |         4 |  0.00 |
| CAPTUREE2   |    0 |     0 |          0 |         1 |  0.00 |
| DNA_PORT    |    0 |     0 |          0 |         1 |  0.00 |
| EFUSE_USR   |    0 |     0 |          0 |         1 |  0.00 |
| FRAME_ECCE2 |    0 |     0 |          0 |         1 |  0.00 |
| ICAPE2      |    0 |     0 |          0 |         2 |  0.00 |
| STARTUPE2   |    0 |     0 |          0 |         1 |  0.00 |
| XADC        |    0 |     0 |          0 |         1 |  0.00 |
+-------------+------+-------+------------+-----------+-------+


7. Primitives
-------------

+----------+------+---------------------+
| Ref Name | Used | Functional Category |
+----------+------+---------------------+
| FDRE     |   63 |        Flop & Latch |
| LUT6     |   45 |                 LUT |
| LUT4     |   34 |                 LUT |
| LUT2     |   33 |                 LUT |
| LUT5     |   24 |                 LUT |
| CARRY4   |   20 |          CarryLogic |
| LUT3     |   13 |                 LUT |
| OBUF     |    2 |                  IO |
| LUT1     |    2 |                 LUT |
| IBUF     |    2 |                  IO |
| BUFG     |    1 |               Clock |
+----------+------+---------------------+


8. Black Boxes
--------------

+-------------+------+
|   Ref Name  | Used |
+-------------+------+
| mul_const_f |    1 |
+-------------+------+


9. Instantiated Netlists
------------------------

+----------+------+
| Ref Name | Used |
+----------+------+

```

#### Simulations

In the simulation, a basic testbench for the top unit was inserted.
To get some preliminary values to use to estimate the capacitance values, the testbench was written in a way that allows us to collect 1000 simulation estimate values from our system:

```verilog
// top_tb.v
`timescale 1ns/1ns

module top_tb;

reg clk;
reg reset;
wire [1:0] led;

integer f;
integer estimates;

initial begin
	estimates <= 0;
	f = $fopen("estimates.txt", "w");

	$dumpfile("top.vcd");
	$dumpvars(0, top);

	reset <= 1'b1;
	#200

	reset <= 1'b0;
	@(posedge clk);
	while (estimates < 1000) begin
		@(posedge clk)
		if (top.estimate_valid == 1'b1) begin
			estimates <= estimates + 1;
			$display("@%2d: estimate is %1d\n", $time, top.estimate);
			$fwrite(f, "@%2d: estimate is %1d\n", $time, top.estimate);
		end
	end

	$fclose(f);  
	$display("test complete");
	$finish;
end

initial begin
	clk <= 1'b0;
	// wait 10ns for freq of 100 MHz
	forever #10 clk <= ~clk;
end

top top_inst (
	.clk(clk),
	.reset(reset),
	.led(led)
);
endmodule
```

These 1000 estimates are written to a file `estimates.txt`, which is then used to be processed by a python 3 script to calculate the average of those estimates:

```python
#!/usr/bin/env python3
values = [int(line.strip().split(' ')[-1]) for line in open('estimates.txt', 'r')]
average = sum(values) / len(values)
print(f'average is {average}')
```

Which yields the output:

```
average is 72259765.625
```

This we can then use as a base to estimate the capacitance $C$.
Our previously estimated dynamic power consumption of the ALU is $0.001 W$
This gives us a scaling factor of $0.001 / 72259765.625 \approx 13.8 pF$ as a value for $C$.

Because our model uses values generated in the power report, our values are only correct insofar the environmental assumptions of the power report are met.
This is why we list those assumption here also:

```
+-----------------------+------------------------+
| Ambient Temp (C)      | 25.0                   |
| ThetaJA (C/W)         | 11.5                   |
| Airflow (LFM)         | 250                    |
| Heat Sink             | none                   |
| ThetaSA (C/W)         | 0.0                    |
| Board Selection       | medium (10"x10")       |
| # of Board Layers     | 8to11 (8 to 11 Layers) |
| Board Temperature (C) | 25.0                   |
+-----------------------+------------------------+
```

Now, finally, we will show some simulation screenshots to show the operation of the simulation.

![Coarse overview of simulation results\label{fig:sim-overview}](graphics/simulation-overview.png)

Figure \ref{fig:sim-overview} shows a coarse overview with a lot of switching activity going on.
As can be seen, every once in a while the `estimate_valid` signal goes high, during which we have captured a power estimate value.

![Simulation capturing a specific estimate value\label{fig:sim-estimate}](graphics/simulation-estimate.png)

Figure \ref{fig:sim-estimate} shows such an estimate value being captured, and we can see the value $0x2d13759 = 47265625$ being captured.
If we normalize this as previously described, then we get a dynamic power estimate value of $P_{dynamic} \approx 0.00275 W$ for our small design.
Do note that there is bound to be some error, because we used approximations in order to avoid division operations in our design!

#### Results

Do note that due to time constraints and lack of availability of an FPGA, sadly, the system was only tested in simulation.
In order to realize a system test it would be necessary to integrate the UART lite IP core properly and test everything thoroughly.
